Lab 08a: $k$ nearest neighbours classification

Introduction

This lab focuses on SMS message spam detection using $k$ nearest neighbours classification. It's a direct counterpart to the rule-based spam detection from Lab 05 and the decision tree models from Lab 07a. At the end of the lab, you should be able to use scikit-learn to:

  • Create a $k$ nearest neighbours classification model.
  • Use the model to predict new values.
  • Measure the accuracy of the model.

Getting started

Let's start by importing the packages we'll need. This week, we're going to use the neighbors subpackage from scikit-learn to build k nearest neighbours models. We'll also use the dummy package to build a baseline model from we which can gauge how good our final model is.


In [ ]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

Next, let's load the data. Write the path to your sms.csv file in the cell below:


In [ ]:
data_file = 'data/sms.csv'

Execute the cell below to load the CSV data into a pandas data frame with the columns label and message.

Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell pandas about the different format using the sep argument, as shown in the cell below. For more information, see the read_csv documentation.


In [ ]:
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()

Next, let's select our feature ($X$) and target ($y$) variables from the data. Usually, we would use all of the available data but, for speed ($k$ nearest neighbours can be CPU intensive), let's just select a random sample. We can do this using the sample method in pandas, as follows:


In [ ]:
sample = sms.sample(frac=0.25, random_state=0)  # Randomly subsample a quarter of the available data

X = sample['message']
y = sample['label']

$k$ nearest neighbours

Let's build a nearest neighbours classification model of the SMS message data. scikit-learn supports nearest neighbours functionality via the neighbors subpackage. This subpackage supports both nearest neighbours regression and classification. We can use the KNeighborsClassifier class to build our model.

KNeighborsClassifier accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params method of the estimator (this works on any scikit-learn estimator), like this:


In [ ]:
KNeighborsClassifier().get_params()

You can find a more detailed description of each parameter in the scikit-learn documentation.

Let's use a grid search to select the optimal nearest neighbours classification model from a set of candidates. First, we need to build a pipeline, just as we did last week. Next, we define the parameter grid. Finally, we use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.

Note: When using grid search with pipelines, we have to adjust the names of our hyperparameters, prepending the name of the class they apply to (in lowercase). This is so that scikit-learn can distinguish which hyperparameters apply to what classes. Below, we prepend the string 'kneighborsclassifier__' to each hyperparameter name because they all apply to the KNeighborsClassifier class.


In [ ]:
pipeline = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    KNeighborsClassifier()
)

# Build models for different values of n_neighbors (k), distance metric and weight scheme
parameters = {
    'kneighborsclassifier__n_neighbors': [2, 5, 10],
    'kneighborsclassifier__metric': ['manhattan', 'euclidean'],
    'kneighborsclassifier__weights': ['uniform', 'distance']
}

# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

print(classification_report(y, y_pred))  # Print the classification report

The model is much more accurate than the rule-based model from Lab 05, but not as accurate as the random forest model from Lab 07a. Specifically, we can say that:

  • 92% of the messages we labelled as ham were actually ham (precision for ham = 0.92).
  • 100% of the messages we labelled as spam were actually spam (precision for spam = 1.00).
  • We labelled every actual ham as ham (recall for ham = 1.00).
  • We labelled 44% of spam as spam (recall for spam = 0.44).

While no ham was misclassified as spam, we only managed to filter 44% of spam emails (approximately one in every two).

As before, we can check the parameters of the selected model using the best_params_ attribute of the fitted grid search:


In [ ]:
clf.best_params_